Note: This is a work in progress.
This R file walks through G. Grolemund & H. Wickhams’s online text, “R for Data Science.” Much of the code is sourced directly from the book and credit belongs to the authors. Here, some sections of code are heavily commented so that the beginning R programmer can read through and understand what each line of code does and compare it to their own as they work through the text. Throughout, the book provides the primary and most thorough explanation. For the greatest learning benefit, I suggest you attempt each exercise on your own before looking at the code or write-ups provided here. Of course, there is more than one way to write code and you may find a more elegant solution that you prefer.
For those new to R and RStudio, it may be of additional benefit to knit the document and examine how the code in the Rmd file is visually expressed in the resultant knitted document. For example, see how the ["R for Data Science."](http://r4ds.had.co.nz/index.html) is expressed as a hyperlink in the preceeding paragraph where it was not surrounded by tick-marks and compare that to how the same text is expressed in this paragraph when surrounded by ticks. See also the difference in appearance when knitting to different document types (HTML, PDF, Word).
Tip: If you are using RStudio, click the text next to the orange # box at the bottom of the editor window to easily navigate the code chunks.
Tip: Use the ? before any command to view the documentation on that function. Do this often. For example, type ?setwd to see a description, usage, arguments, and more for the function setwd().
To really understand ggplot2, I highly recommend reading “The Layered Grammar of Graphics” as suggested at the beginning of Chapter 3.
mpg data framestr(mpg) # Look at the structure of the mpg data frame
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
mpg # Look at the first 10 rows of the mpg data frame
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
Hypothesis: There is a negative linear relationship between engine size and fuel efficiency, such that as engine size increases fuel efficiency decreases.
ggplot(data=mpg) + # specify data frame
geom_point(mapping = aes(x = displ, y = hwy)) # specify that plot is a scatterplot with displ on the x axis and hwy on the y axis
The plot confirms the hypothesis that there is a negative relationship between engine size and fuel efficiency.
Template:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = mpg)
str(mpg), we see that there are 234 rows and 11 columns in the mpg data frame.# Alternative means of finding number of rows and columns
nrow(mpg) # Pring the number of rows
## [1] 234
ncol(mpg)
## [1] 11
There are 234 rows and 11 columns in the mpg data frame.
drv variable describes whether the vehicle is front, rear, or 4-wheel drive.?mpg
ggplot(data=mpg) +
geom_point(mapping = aes(x=class, y=drv))
Test the hypothesis that the cars highlighted in red are hybrids by mapping car class to an aesthetic.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) # map class to the color aesthetic so that three variables are now distinguishable in the plot: engine displacement on the x axis, highway miles per gallon on the y axis, and car class by color.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
# Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Right
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue") # Set the aesthetic outside of aes() to manually assign it to all points
Aesthetic shapes:
aes().# Problematic code
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
# Corrected code
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
mpg or by viewing the documentation ?mpg. One may decide whether a variable is categorical or continuous by checking whether it is stored as a character, integer, or double (floating point integer) value. However, this can lead to miscategorization in some cases. For example, while year is an integer, it is typically considered a whole number, a discrete variable without a meaningful 0 value anchor, and therefore not continuous.The categorical variables are:
The continuous variables are:
# Mapping a continuous variable to the shape aesthetic
ggplot(data=mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
# Mapping continuous variables to the color and size aesthetics
ggplot(data=mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = cty))
# Mapping categorical variables to size, color, and shape
ggplot(data=mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = model, color = class, shape = drv))
# Mapping the same variable to multiple aesthetics
ggplot(data=mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty, size = cty)) # Here, city is mapped to the size and color aesthetics
Tip: You can find documentation of available colors here.
?geom_point
# Example using `stroke`
ggplot(data=mpg)+
geom_point(mapping = aes(x=displ, y=hwy), shape = 21, colour = "darkgreen", fill = "gold", size = 5, stroke = 5) # `size` sets the area of the inside (gold) and `stroke` sets the area of the outline (green)
# Just for fun, let's write short-hand code make the same plot
ggplot(mpg, aes(displ, hwy)) +
geom_point(shape = 21, colour = "darkgreen", fill = "gold", size = 5, stroke = 5)
displ < 5 will assign one color to all x-axis (hwy) values < 5 and a different color to x-axis values \(\ge\) 5. Since the color palette is not specified, default colors are used.ggplot(data=mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2) # This will create a separate plot for each class of vehicle and will fit the plots into 2 rows
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl) # This will create a grid of plots with one plot for each combination of drv and cyl
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl) # Use the . to create plots for each level of cylinder (cyl) in the columns dimension. To facet in the rows dimension, use `facet_grid(cyl ~ .)`
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ hwy, nrow = 2)
facet_grid(drv ~ cyl) indicate that there are no cars with at the intersection of that number of cylinders and that type of drive (e.g. no cars with 5 cylinders and 4-wheel drive). The absence of vehicles corresponding to specific cylinde r-drive combinations is also evident in the second plot. Those intersections in the second plot without a point correspond to the empty cells in the first plot (see again cars with 5 cylinders on the y-axis and 4-wheel drive on the x-axis).# First plot, with drive and cylinder are faceted
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
# Second plot, with drive and cylinder represented on the axes of a single plot
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
. in the second position specifies that drive type should be displayed in rows. The second plot shows highway miles per gallon and engine displacement faceted by number of cylinders. The . in the first position specifies that number of cylinders should be displayed in columns.# Plot of highway mpg and engine displacement faceted by drive type
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
# The above is the same as the following except that the drive labels shift from right to top aligned. Uncomment and run the code to see the difference.
#ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy)) +
# facet_wrap(~ drv, nrow = 3)
# Plot of highway mpg and engine displacement faceted by number of cylinders
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
# The above is the same as the following. Uncomment and run the code to see.
#ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy)) +
# facet_wrap(~ cyl, nrow = 1)
# Plot with facets
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
# Plot with color aesthetic
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
nrow - specifies the number of rows into which the faceted plots are fitted.
ncol - specifies the number of columns into which the faceted plots are fitted.
facet_grid() does not have nrow or ncol arguments because the number of rows and columns is determined by the number of levels of the row and column facetting variables.
?facet_wrap